In previous lessons, we focused on elementwise operations (like a basic ReLU on a matrix). These are memory-bound because the GPU spends more time moving data from HBM to registers than performing math.
1. Why GEMM is Central
General Matrix Multiplication (GEMM) has a computational complexity of $O(N^3)$ while only requiring $O(N^2)$ memory access. This allows us to hide memory latency behind massive arithmetic throughput, making it the "heartbeat" of LLMs.
2. 2D Memory Representation
Physical RAM is 1D. To represent a 2D tensor, we use Strides. A common production pitfall is assuming a tensor is contiguous. If you mix up row and column strides in your pointer math, you will access "ghost" data or trigger memory violations.
3. Tiled Generalization
Triton generalizes elementwise logic by shifting from single pointers to blocks of pointers. By using 2D tiles (e.g., $16 \times 16$), we exploit data reuse in high-speed SRAM, keeping data "hot" for fused operations like Bias addition or activations before writing back to Global Memory.